LCORE-1497: Fix RHOAI Prow e2e pipeline failures#1613
LCORE-1497: Fix RHOAI Prow e2e pipeline failures#1613radofuchs merged 18 commits intolightspeed-core:mainfrom
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Build llama-stack image in OpenShift internal registry via oc new-build/start-build - Add image-puller role for default SA to pull from internal registry - Add FAISS_VECTOR_STORE_ID and KV_RAG_PATH env vars to lightspeed-stack pod - Add inference, byok_rag, and rag sections to prow lightspeed-stack configs - Use envsubst with specific variable scoping in pipeline-services.sh - Fix free_local_tcp_port to only kill LISTEN sockets (was killing behave process) - Add MCP token secrets and empty OpenAI secret to pipeline.sh - Add rlsapi_v1_infer action to prow RBAC config - Simplify llama-stack.yaml to use pre-built image Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add DEBUG NS checkpoints to trace when e2e-rhoai-dsc namespace disappears during operator bootstrapping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The RHOAI operator deletes the e2e-rhoai-dsc namespace during DSC reconciliation. Reorder pipeline to run operator bootstrap first, then create namespace and secrets after DSC settles. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename llama-stack.yaml to llama-stack-prow.yaml and add: - Config enrichment via llama_stack_configuration.py - restore_rag_seed() to re-inflate RAG db after enrichment - PYTHONPATH, lightspeed-stack.yaml mount, rag-data mount - materialize-run-yaml init container - Model/provider overrides in inline_rag e2e tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Create llama-stack-ip-secret before deploying the pod to fix chicken-and-egg dependency where the pod requires the secret as a non-optional env var - Add LLAMA_STACK_CONFIG env var pointing to the correct emptyDir mount path where materialize-run-yaml init container places run.yaml - Use make test-e2e-local instead of test-e2e to avoid macOS-incompatible script -c flag - Remove DEBUG NS echo lines from pipeline scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace unfiltered envsubst with sed in e2e-ops.sh restart commands to prevent blanking $VAR references in embedded bash scripts - Add mock-jwks port-forward management (kill/restart/health check) so RBAC and MCP tests don't fail with connection refused on :8000 - Restart mock-jwks port-forward as part of lightspeed restart - Increase vLLM max-model-len from 2048 to 32768 to avoid context length errors with RAG queries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
verify_connectivity now checks /v1/models returns 200 (not just /readiness) to ensure the app is fully initialized before declaring success. before_scenario in the test framework probes the port-forward before each scenario and auto-restarts it via e2e-ops if dead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace fragile oc delete + oc create with oc create --dry-run | oc apply so a failed update leaves the ConfigMap intact instead of deleted. The old approach caused 156 errored scenarios: if create failed after delete succeeded, the ConfigMap was gone and every subsequent update also failed. Also print stdout/stderr from e2e-ops on failure so the actual oc error is visible in test logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Prow, both /readiness and /v1/models return 401 when auth is enabled. The previous fix only accepted 200 from /v1/models, causing connectivity checks to always fail and port-forward to be declared dead. Accept 401 as valid — it proves the full app stack is running, not just the socket. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion Llama Stack disruption tests left the pod dead after the feature because Behave clears custom context attributes between scenarios, so after_feature never saw llama_stack_was_running=True. This caused 59+ subsequent scenarios to cascade-fail with Connection refused. Three fixes: - Store was_running in module-level state (survives Behave context resets) so after_feature reliably triggers _restore_llama_stack - Add restart-lightspeed fallback in before_scenario when port-forward alone fails (recovers from dead pods, not just dead tunnels) - Align pipeline.sh with pipeline-konflux.sh: export PID file paths for e2e-ops.sh, start Llama Stack port-forward on :8321, and use lsof/fuser fallback for port cleanup in minimal images Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TLS and proxy features depend on mock-tls-inference and proxy sidecars that are only deployed via Docker Compose, not in the OpenShift cluster. Every TLS scenario burned 200s waiting for a provider that never exists, consuming ~63 min of the 4h Prow timeout for guaranteed failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WalkthroughRefactors e2e Prow manifests and orchestration: adds a new Prow-specific Llama Stack Pod with init containers for RAG/config materialization, removes the old manifest, bumps vLLM max model length, adds namespace/image handling and port‑forward management in pipeline scripts, and introduces Prow-specific test gating and recovery logic. Changes
Sequence Diagram(s)sequenceDiagram
participant Init as Init Container (RAG/Config)
participant CM as ConfigMap (rag-data / llama-stack-config)
participant Main as Main Container (llama-stack)
participant LCS as Lightspeed Service
participant Health as Health Probe (/v1/health)
CM->>Init: mount rag-data, llama-stack-config
Init->>Init: extract/decompress RAG KV (kv_store.db)
Init->>Init: materialize base run.yaml
Init-->>Main: signal ready
Main->>Main: prefer mounted gz seed else use extracted seed
Main->>Main: check for lightspeed-stack.yaml
alt lightspeed-stack.yaml present
Main->>Main: run llama_stack_configuration.py -> enriched run.yaml + .env
Main->>Main: source .env if present
Main->>Main: use enriched run.yaml
else
Main->>Main: fallback to original run.yaml
end
Main->>LCS: launch (llama stack run)
LCS->>Health: bind and expose /v1/health on :8321
Health-->>Main: readiness probe succeeds
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/e2e/features/steps/health.py (1)
68-95:⚠️ Potential issue | 🟠 MajorCheck Docker running state explicitly against
"true"before stopping.Line 92 currently uses
if result.stdout.strip():which treats any non-empty string output as truthy. Sincedocker inspect -f "{{.State.Running}}"returns the string literal"true"or"false", this causes the conditional to incorrectly evaluate"false"as truthy, leading to wrongly stopping a non-running container and incorrectly settingwas_running = True.Proposed fix
- if result.stdout.strip(): + if result.stdout.strip().lower() == "true": context.llama_stack_was_running = True _llama_stack_was_running["value"] = True subprocess.run( ["docker", "stop", "llama-stack"], check=True, capture_output=True )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/features/steps/health.py` around lines 68 - 95, The Docker running-state check uses if result.stdout.strip(): which treats "false" as truthy; change the condition to explicitly compare the trimmed output to "true" (e.g. result.stdout.strip().lower() == "true") when setting context.llama_stack_was_running and _llama_stack_was_running["value"] and before calling subprocess.run to stop the container so non-running containers aren't stopped and was_running is set correctly.tests/e2e-prow/rhoai/pipeline.sh (1)
390-421:⚠️ Potential issue | 🟠 MajorDon’t require unauthenticated
200from/v1/models.In auth-enabled Prow, Lightspeed is healthy when
/v1/modelsreturns401, but this loop usescurl -fand will spin for three minutes before failing anyway. Please align the initial wait withverify_connectivity()intests/e2e-prow/rhoai/scripts/e2e-ops.shby accepting200or401after the forward comes up, or by probing/readinessfirst.Suggested fix
for i in $(seq 1 36); do - if curl -sf http://localhost:8080/v1/models > /dev/null 2>&1; then + http_code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 \ + http://localhost:8080/v1/models 2>/dev/null || echo 000) + if [[ "$http_code" == "200" || "$http_code" == "401" ]]; then echo "✅ Port-forward ready after $(( i * 5 ))s" break fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e-prow/rhoai/pipeline.sh` around lines 390 - 421, The readiness loop currently requires an unauthenticated 200 from /v1/models; update the check in the loop that uses PF_LCS_PID to accept either HTTP 200 or 401 (unauthorized) as success (or alternatively probe /readiness first) so auth-enabled clusters don't wait the full timeout; modify the curl invocation logic to capture HTTP status and treat status 200 or 401 as success, keeping the existing restart-on-death behavior that inspects PF_LCS_PID and preserving the debug/logging and kill/exit flow on final timeout to mirror verify_connectivity() behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml`:
- Around line 33-41: The init script is seeding under /data/src/.llama/storage
which ends up nested in the main container; instead seed the shared mount root
(/data) so the main container's KV_RAG_PATH sees the files. Update the mkdir,
gunzip/cp and chmod invocations to create and write to /data/.llama/storage/rag
and /data/.llama/storage/files (not /data/src/...), e.g. ensure the script uses
/rag-data/kv_store.db.gz -> /data/.e2e-rag-seed/kv_store.db and then copies that
file to /data/.llama/storage/rag/kv_store.db, and adjust the final chmod to
operate on /data and /data/.e2e-rag-seed so the prepopulated path matches
KV_RAG_PATH.
In `@tests/e2e-prow/rhoai/pipeline-services.sh`:
- Around line 4-30: Add fail-fast behavior by inserting set -euo pipefail near
the top (immediately after the NAMESPACE assignment) so any failing command
(envsubst | oc apply, oc wait pod/llama-stack-service, oc expose pod
llama-stack-service, etc.) aborts the script; also stop swallowing errors on
secret creation—replace the silent redirection and echo around oc create secret
llama-stack-ip-secret with a check that only ignores "already exists" (e.g., try
oc create and on failure verify oc get secret llama-stack-ip-secret returns
success, otherwise exit) so unexpected failures surface immediately.
In `@tests/e2e-prow/rhoai/pipeline.sh`:
- Around line 359-369: The port-cleanup block only tries lsof and fuser and
needs the /proc fallback used in tests/e2e-prow/rhoai/scripts/e2e-ops.sh to
catch stale oc port-forward listeners; update the cleanup in pipeline.sh (the
block that checks command -v lsof / command -v fuser and kills ports 8080, 8000,
8321) to add the /proc-based scan as the final elif/else branch that iterates
/proc/*/fd, resolves socket links to find LISTEN sockets bound to those ports
and kills the owning PIDs, mirroring the logic from e2e-ops.sh so binds won’t
fail when lsof/fuser are absent.
- Around line 377-388: The JWKS port-forward started into PF_JWKS_PID (and
persisted to E2E_JWKS_PORT_FORWARD_PID_FILE) is not validated before tests run;
add a readiness check after spawning oc port-forward svc/mock-jwks by polling
localhost:8000/tokens (or invoking the existing restart-jwks-port-forward
helper) until it returns success or a timeout, and only proceed once /tokens is
reachable; ensure the health loop uses PF_JWKS_PID to detect if the background
process died and call restart-jwks-port-forward if needed so the RBAC tests
won't start against an unavailable JWKS endpoint.
In `@tests/e2e-prow/rhoai/scripts/bootstrap.sh`:
- Around line 95-96: The script contains two consecutive sleep commands ("sleep
5" followed by "sleep 10") which should be collapsed into a single sleep to
improve clarity; replace the two lines with one "sleep 15" in the bootstrap.sh
script so the total delay is preserved while simplifying the code.
In `@tests/e2e-prow/rhoai/scripts/e2e-ops.sh`:
- Around line 505-544: The health checks treat any non-"000" curl result as
healthy; change the logic that examines the curl http_code (the two occurrences
that compare http_code != "000" when checking
"http://127.0.0.1:$local_port/tokens") to require a successful response (e.g.
http_code == "200" or at minimum a 2xx check) before claiming the JWKS
port-forward is healthy and before writing pf_pid to
E2E_JWKS_PORT_FORWARD_PID_FILE; adjust the checks near the variables
E2E_JWKS_PORT_FORWARD_PID_FILE, jwks_pf_log, pf_pid and the curl invocations so
only a true success (200/2xx) allows early return or saving the PID.
In `@tests/e2e/utils/prow_utils.py`:
- Around line 208-209: Replace the two print calls used for update-configmap
diagnostics with the module logger: add "logger = get_logger(__name__)"
(importing get_logger from log.py) near the top of prow_utils.py, then change
the prints that reference result.stdout and result.stderr to structured logger
calls (e.g., logger.debug or logger.info with format placeholders) so the
messages use the module logger and pass result.stdout/result.stderr as
parameters instead of using print.
---
Outside diff comments:
In `@tests/e2e-prow/rhoai/pipeline.sh`:
- Around line 390-421: The readiness loop currently requires an unauthenticated
200 from /v1/models; update the check in the loop that uses PF_LCS_PID to accept
either HTTP 200 or 401 (unauthorized) as success (or alternatively probe
/readiness first) so auth-enabled clusters don't wait the full timeout; modify
the curl invocation logic to capture HTTP status and treat status 200 or 401 as
success, keeping the existing restart-on-death behavior that inspects PF_LCS_PID
and preserving the debug/logging and kill/exit flow on final timeout to mirror
verify_connectivity() behavior.
In `@tests/e2e/features/steps/health.py`:
- Around line 68-95: The Docker running-state check uses if
result.stdout.strip(): which treats "false" as truthy; change the condition to
explicitly compare the trimmed output to "true" (e.g.
result.stdout.strip().lower() == "true") when setting
context.llama_stack_was_running and _llama_stack_was_running["value"] and before
calling subprocess.run to stop the container so non-running containers aren't
stopped and was_running is set correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c48d2bc8-9576-4364-947c-545202f5ad32
📒 Files selected for processing (15)
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yamltests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yamltests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yamltests/e2e-prow/rhoai/manifests/operators/ds-cluster.yamltests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yamltests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yamltests/e2e-prow/rhoai/pipeline-services.shtests/e2e-prow/rhoai/pipeline.shtests/e2e-prow/rhoai/scripts/bootstrap.shtests/e2e-prow/rhoai/scripts/e2e-ops.shtests/e2e/features/environment.pytests/e2e/features/proxy.featuretests/e2e/features/steps/health.pytests/e2e/features/tls.featuretests/e2e/utils/prow_utils.py
💤 Files with no reviewable changes (2)
- tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml
- tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: build-pr
- GitHub Check: E2E: server mode / ci / group 2
- GitHub Check: E2E: library mode / ci / group 3
- GitHub Check: E2E Tests for Lightspeed Evaluation job
- GitHub Check: E2E: server mode / ci / group 3
- GitHub Check: E2E: library mode / ci / group 1
- GitHub Check: E2E: server mode / ci / group 1
- GitHub Check: E2E: library mode / ci / group 2
🧰 Additional context used
📓 Path-based instructions (2)
tests/e2e/**/*.{py,feature}
📄 CodeRabbit inference engine (AGENTS.md)
Use behave (BDD) framework for end-to-end testing
Files:
tests/e2e/features/tls.featuretests/e2e/features/proxy.featuretests/e2e/utils/prow_utils.pytests/e2e/features/steps/health.pytests/e2e/features/environment.py
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Import FastAPI dependencies with:from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with:from llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Usetyping_extensions.Selffor model validators in Pydantic models
Use modern union type syntaxstr | intinstead ofUnion[str, int]
UseOptional[Type]for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Useasync deffor I/O operations and external API calls
HandleAPIConnectionErrorfrom Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with@abstractmethoddecorators
Use@model_validatorand@field_validatorfor Pydantic model validation
Complete type annotations for all class attributes; use specific types, notAny
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections
Files:
tests/e2e/utils/prow_utils.pytests/e2e/features/steps/health.pytests/e2e/features/environment.py
🧠 Learnings (8)
📚 Learning: 2026-02-19T10:06:50.647Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.
Applied to files:
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yamltests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yamltests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yamltests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml
📚 Learning: 2025-09-02T11:09:40.404Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 485
File: tests/e2e/features/environment.py:87-95
Timestamp: 2025-09-02T11:09:40.404Z
Learning: In the lightspeed-stack e2e tests, noop authentication tests use the default lightspeed-stack.yaml configuration, while noop-with-token tests use the Authorized tag to trigger a config swap to the specialized noop-with-token configuration file.
Applied to files:
tests/e2e/features/proxy.feature
📚 Learning: 2025-09-02T11:15:02.411Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 485
File: tests/e2e/test_list.txt:2-3
Timestamp: 2025-09-02T11:15:02.411Z
Learning: In the lightspeed-stack e2e tests, the Authorized tag is intentionally omitted from noop authentication tests because they are designed to test against the default lightspeed-stack.yaml configuration rather than the specialized noop-with-token configuration.
Applied to files:
tests/e2e/features/proxy.feature
📚 Learning: 2026-02-19T10:06:58.708Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:58.708Z
Learning: In tests/e2e-prow/rhoai/, ConfigMaps like mock-jwks-script and mcp-mock-server-script are created dynamically by the pipeline.sh deployment script using `oc create configmap` commands, rather than being defined as static ConfigMap resources in the manifest YAML files.
Applied to files:
tests/e2e/utils/prow_utils.pytests/e2e-prow/rhoai/scripts/bootstrap.shtests/e2e-prow/rhoai/pipeline-services.shtests/e2e-prow/rhoai/scripts/e2e-ops.shtests/e2e-prow/rhoai/pipeline.sh
📚 Learning: 2026-04-13T13:39:54.963Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1490
File: tests/e2e/features/environment.py:206-211
Timestamp: 2026-04-13T13:39:54.963Z
Learning: In lightspeed-stack E2E tests under tests/e2e/features, it is intentional to set context.feature_config inside Background/step functions (scenario-scoped Behave layer). The environment.py after_scenario restore logic should only restore configuration when context.scenario_lightspeed_override_active is True; this flag is set by configure_service only when a real config switch occurs (so restore does not run for scenarios without a switch). Additionally, steps/common.py’s module-level _active_lightspeed_stack_config_basename is used to prevent re-applying the same config across subsequent scenarios, ensuring scenario_lightspeed_override_active stays False after the first apply. Therefore, reviewers should not “fix” this flow as if feature_config were incorrectly scoped or if after_scenario restoration is missing—config switching and restoration are meant to happen exactly once per actual switch, not redundantly per scenario.
Applied to files:
tests/e2e/features/steps/health.pytests/e2e/features/environment.py
📚 Learning: 2026-04-07T09:20:26.590Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1467
File: tests/e2e/features/steps/common.py:36-49
Timestamp: 2026-04-07T09:20:26.590Z
Learning: For Behave-based Python tests, rely on Behave’s Context layered stack for attribute lifecycle: Behave pushes a new Context layer when entering feature scope (before_feature) and again for scenario scope (before_scenario). Attributes assigned inside given/when/then steps live on the current scenario layer and are automatically removed when the scenario ends. As a result, step-set attributes should not be expected to persist across scenarios or features, and manual cleanup in after_scenario/after_feature is generally unnecessary for attributes set in step functions. Only perform manual cleanup for attributes that you set explicitly in before_feature/before_scenario, since those live on the respective feature/scenario layers.
Applied to files:
tests/e2e/features/steps/health.pytests/e2e/features/environment.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to tests/e2e/**/*.{py,feature} : Use behave (BDD) framework for end-to-end testing
Applied to files:
tests/e2e/features/environment.py
📚 Learning: 2025-08-18T10:56:55.349Z
Learnt from: matysek
Repo: lightspeed-core/lightspeed-stack PR: 292
File: pyproject.toml:0-0
Timestamp: 2025-08-18T10:56:55.349Z
Learning: The lightspeed-stack project intentionally uses a "generic image" approach, bundling many dependencies directly in the base runtime image to work for everyone, rather than using lean base images with optional dependency groups.
Applied to files:
tests/e2e-prow/rhoai/pipeline.sh
🪛 Checkov (3.2.525)
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml
[low] 7-205: CPU limits should be set
(CKV_K8S_11)
[low] 7-205: CPU requests should be set
(CKV_K8S_10)
[low] 7-205: The default namespace should not be used
(CKV_K8S_21)
[low] 7-205: Image should use digest
(CKV_K8S_43)
[low] 7-205: Image Tag should be fixed - not latest or blank
(CKV_K8S_14)
[low] 7-205: Memory limits should be set
(CKV_K8S_13)
[low] 7-205: Memory requests should be set
(CKV_K8S_12)
[low] 7-205: Use read-only filesystem for containers where possible
(CKV_K8S_22)
[low] 7-205: Containers should run as a high UID to avoid host conflict
(CKV_K8S_40)
[low] 7-205: Prefer using secrets as files over secrets as environment variables
(CKV_K8S_35)
[low] 7-205: Ensure that Service Account Tokens are only mounted where necessary
(CKV_K8S_38)
🪛 Trivy (0.69.3)
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml
[info] 72-193: CPU not limited
Container 'llama-stack-container' of Pod 'llama-stack-service' should set 'resources.limits.cpu'
Rule: KSV-0011
(IaC/Kubernetes)
[info] 48-70: CPU not limited
Container 'materialize-run-yaml' of Pod 'llama-stack-service' should set 'resources.limits.cpu'
Rule: KSV-0011
(IaC/Kubernetes)
[info] 18-47: CPU not limited
Container 'setup-rag-data' of Pod 'llama-stack-service' should set 'resources.limits.cpu'
Rule: KSV-0011
(IaC/Kubernetes)
[warning] 72-193: Image tag ":latest" used
Container 'llama-stack-container' of Pod 'llama-stack-service' should specify an image tag
Rule: KSV-0013
(IaC/Kubernetes)
[warning] 48-70: Image tag ":latest" used
Container 'materialize-run-yaml' of Pod 'llama-stack-service' should specify an image tag
Rule: KSV-0013
(IaC/Kubernetes)
[warning] 18-47: Image tag ":latest" used
Container 'setup-rag-data' of Pod 'llama-stack-service' should specify an image tag
Rule: KSV-0013
(IaC/Kubernetes)
[error] 72-193: Root file system is not read-only
Container 'llama-stack-container' of Pod 'llama-stack-service' should set 'securityContext.readOnlyRootFilesystem' to true
Rule: KSV-0014
(IaC/Kubernetes)
[error] 48-70: Root file system is not read-only
Container 'materialize-run-yaml' of Pod 'llama-stack-service' should set 'securityContext.readOnlyRootFilesystem' to true
Rule: KSV-0014
(IaC/Kubernetes)
[error] 18-47: Root file system is not read-only
Container 'setup-rag-data' of Pod 'llama-stack-service' should set 'securityContext.readOnlyRootFilesystem' to true
Rule: KSV-0014
(IaC/Kubernetes)
[info] 72-193: CPU requests not specified
Container 'llama-stack-container' of Pod 'llama-stack-service' should set 'resources.requests.cpu'
Rule: KSV-0015
(IaC/Kubernetes)
[info] 48-70: CPU requests not specified
Container 'materialize-run-yaml' of Pod 'llama-stack-service' should set 'resources.requests.cpu'
Rule: KSV-0015
(IaC/Kubernetes)
[info] 18-47: CPU requests not specified
Container 'setup-rag-data' of Pod 'llama-stack-service' should set 'resources.requests.cpu'
Rule: KSV-0015
(IaC/Kubernetes)
[info] 72-193: Memory requests not specified
Container 'llama-stack-container' of Pod 'llama-stack-service' should set 'resources.requests.memory'
Rule: KSV-0016
(IaC/Kubernetes)
[info] 48-70: Memory requests not specified
Container 'materialize-run-yaml' of Pod 'llama-stack-service' should set 'resources.requests.memory'
Rule: KSV-0016
(IaC/Kubernetes)
[info] 18-47: Memory requests not specified
Container 'setup-rag-data' of Pod 'llama-stack-service' should set 'resources.requests.memory'
Rule: KSV-0016
(IaC/Kubernetes)
[info] 72-193: Memory not limited
Container 'llama-stack-container' of Pod 'llama-stack-service' should set 'resources.limits.memory'
Rule: KSV-0018
(IaC/Kubernetes)
[info] 48-70: Memory not limited
Container 'materialize-run-yaml' of Pod 'llama-stack-service' should set 'resources.limits.memory'
Rule: KSV-0018
(IaC/Kubernetes)
[info] 18-47: Memory not limited
Container 'setup-rag-data' of Pod 'llama-stack-service' should set 'resources.limits.memory'
Rule: KSV-0018
(IaC/Kubernetes)
[info] 72-193: Runs with UID <= 10000
Container 'llama-stack-container' of Pod 'llama-stack-service' should set 'securityContext.runAsUser' > 10000
Rule: KSV-0020
(IaC/Kubernetes)
[info] 72-193: Runs with GID <= 10000
Container 'llama-stack-container' of Pod 'llama-stack-service' should set 'securityContext.runAsGroup' > 10000
Rule: KSV-0021
(IaC/Kubernetes)
[info] 48-70: Runs with GID <= 10000
Container 'materialize-run-yaml' of Pod 'llama-stack-service' should set 'securityContext.runAsGroup' > 10000
Rule: KSV-0021
(IaC/Kubernetes)
[info] 18-47: Runs with GID <= 10000
Container 'setup-rag-data' of Pod 'llama-stack-service' should set 'securityContext.runAsGroup' > 10000
Rule: KSV-0021
(IaC/Kubernetes)
[info] 9-12: Workloads in the default namespace
pod llama-stack-service in default namespace should set metadata.namespace to a non-default namespace
Rule: KSV-0110
(IaC/Kubernetes)
🔇 Additional comments (5)
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml (1)
31-32: KV RAG path wiring looks good.The new
KV_RAG_PATHenv var is correctly scoped to the Lightspeed container and matches the intended RAG path usage.tests/e2e/features/proxy.feature (1)
1-1: Prow skip tagging is correctly applied.Adding
@skip-in-prowat feature scope is appropriate for scenarios requiring services not present in Prow.tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml (1)
27-27:max-model-lenupdate is consistent with the runtime adjustment.No issues in this changed segment.
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)
27-27: CPU runtime max length update is aligned with related manifest changes.No issues in this changed segment.
tests/e2e/features/tls.feature (1)
1-1: Feature-level Prow gating is correctly added.The new
@skip-in-prowtag cleanly excludes TLS scenarios in environments that lack required compose services.
| mkdir -p /data/src/.llama/storage/rag /data/src/.llama/storage/files /data/.e2e-rag-seed | ||
| if [ ! -f /rag-data/kv_store.db.gz ]; then | ||
| echo "FATAL: missing /rag-data/kv_store.db.gz" | ||
| ls -la /rag-data || true | ||
| exit 1 | ||
| fi | ||
| gunzip -c /rag-data/kv_store.db.gz > /data/.e2e-rag-seed/kv_store.db | ||
| cp -f /data/.e2e-rag-seed/kv_store.db /data/src/.llama/storage/rag/kv_store.db | ||
| chmod -R 777 /data/src /data/.e2e-rag-seed |
There was a problem hiding this comment.
Seed the shared volume at its mounted root, not under /data/src/.llama/storage.
rag-storage is mounted as /data in this init container but as /opt/app-root/src/.llama/storage in the main container. Writing to /data/src/.llama/storage/... therefore lands at /opt/app-root/src/.llama/storage/src/.llama/storage/..., so the init container never prepopulates the path that KV_RAG_PATH points to.
Suggested fix
- mkdir -p /data/src/.llama/storage/rag /data/src/.llama/storage/files /data/.e2e-rag-seed
+ mkdir -p /data/rag /data/files /data/.e2e-rag-seed
...
- cp -f /data/.e2e-rag-seed/kv_store.db /data/src/.llama/storage/rag/kv_store.db
- chmod -R 777 /data/src /data/.e2e-rag-seed
+ cp -f /data/.e2e-rag-seed/kv_store.db /data/rag/kv_store.db
+ chmod -R 777 /data/rag /data/files /data/.e2e-rag-seed📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mkdir -p /data/src/.llama/storage/rag /data/src/.llama/storage/files /data/.e2e-rag-seed | |
| if [ ! -f /rag-data/kv_store.db.gz ]; then | |
| echo "FATAL: missing /rag-data/kv_store.db.gz" | |
| ls -la /rag-data || true | |
| exit 1 | |
| fi | |
| gunzip -c /rag-data/kv_store.db.gz > /data/.e2e-rag-seed/kv_store.db | |
| cp -f /data/.e2e-rag-seed/kv_store.db /data/src/.llama/storage/rag/kv_store.db | |
| chmod -R 777 /data/src /data/.e2e-rag-seed | |
| mkdir -p /data/rag /data/files /data/.e2e-rag-seed | |
| if [ ! -f /rag-data/kv_store.db.gz ]; then | |
| echo "FATAL: missing /rag-data/kv_store.db.gz" | |
| ls -la /rag-data || true | |
| exit 1 | |
| fi | |
| gunzip -c /rag-data/kv_store.db.gz > /data/.e2e-rag-seed/kv_store.db | |
| cp -f /data/.e2e-rag-seed/kv_store.db /data/rag/kv_store.db | |
| chmod -R 777 /data/rag /data/files /data/.e2e-rag-seed |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml` around lines
33 - 41, The init script is seeding under /data/src/.llama/storage which ends up
nested in the main container; instead seed the shared mount root (/data) so the
main container's KV_RAG_PATH sees the files. Update the mkdir, gunzip/cp and
chmod invocations to create and write to /data/.llama/storage/rag and
/data/.llama/storage/files (not /data/src/...), e.g. ensure the script uses
/rag-data/kv_store.db.gz -> /data/.e2e-rag-seed/kv_store.db and then copies that
file to /data/.llama/storage/rag/kv_store.db, and adjust the final chmod to
operate on /data and /data/.e2e-rag-seed so the prepopulated path matches
KV_RAG_PATH.
| NAMESPACE="${NAMESPACE:-e2e-rhoai-dsc}" | ||
|
|
||
| # Deploy llama-stack | ||
| envsubst < "$BASE_DIR/manifests/lightspeed/llama-stack.yaml" | oc apply -f - | ||
| # Create llama-stack-ip-secret before deploying the pod (it references the secret as an env var) | ||
| export E2E_LLAMA_HOSTNAME="llama-stack-service-svc.${NAMESPACE}.svc.cluster.local" | ||
| oc create secret generic llama-stack-ip-secret \ | ||
| --from-literal=key="$E2E_LLAMA_HOSTNAME" \ | ||
| -n "$NAMESPACE" 2>/dev/null || echo "Secret llama-stack-ip-secret exists" | ||
|
|
||
| # Deploy llama-stack (substitute only LLAMA_STACK_IMAGE, leave other ${} intact) | ||
| envsubst '${LLAMA_STACK_IMAGE}' < "$BASE_DIR/manifests/lightspeed/llama-stack-prow.yaml" | oc apply -n "$NAMESPACE" -f - | ||
|
|
||
| oc wait pod/llama-stack-service \ | ||
| -n e2e-rhoai-dsc --for=condition=Ready --timeout=600s | ||
| -n "$NAMESPACE" --for=condition=Ready --timeout=600s | ||
|
|
||
| # Get url address of llama-stack pod | ||
| oc label pod llama-stack-service pod=llama-stack-service -n e2e-rhoai-dsc | ||
| # Expose llama-stack service | ||
| oc label pod llama-stack-service pod=llama-stack-service -n "$NAMESPACE" | ||
|
|
||
| oc expose pod llama-stack-service \ | ||
| --name=llama-stack-service-svc \ | ||
| --port=8321 \ | ||
| --type=ClusterIP \ | ||
| -n e2e-rhoai-dsc | ||
|
|
||
| export E2E_LLAMA_HOSTNAME="llama-stack-service-svc.e2e-rhoai-dsc.svc.cluster.local" | ||
|
|
||
| oc create secret generic llama-stack-ip-secret \ | ||
| --from-literal=key="$E2E_LLAMA_HOSTNAME" \ | ||
| -n e2e-rhoai-dsc || echo "Secret exists" | ||
| -n "$NAMESPACE" | ||
|
|
||
| # Deploy lightspeed-stack | ||
| oc apply -f "$BASE_DIR/manifests/lightspeed/lightspeed-stack.yaml" | ||
| # Deploy lightspeed-stack (substitute only LIGHTSPEED_STACK_IMAGE, leave other ${} intact) | ||
| LIGHTSPEED_STACK_IMAGE="${LIGHTSPEED_STACK_IMAGE:-quay.io/lightspeed-core/lightspeed-stack:dev-latest}" | ||
| export LIGHTSPEED_STACK_IMAGE | ||
| envsubst '${LIGHTSPEED_STACK_IMAGE}' < "$BASE_DIR/manifests/lightspeed/lightspeed-stack.yaml" | oc apply -n "$NAMESPACE" -f - |
There was a problem hiding this comment.
Make this script fail fast on deployment errors.
This file still runs without set -euo pipefail, so a timed-out oc wait, failed oc expose, or failed manifest apply can be overwritten by a later successful command and pipeline.sh will continue with a half-deployed environment.
Suggested fix
#!/bin/bash
+set -euo pipefail
BASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
NAMESPACE="${NAMESPACE:-e2e-rhoai-dsc}"
+: "${LLAMA_STACK_IMAGE:?LLAMA_STACK_IMAGE must be set}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e-prow/rhoai/pipeline-services.sh` around lines 4 - 30, Add fail-fast
behavior by inserting set -euo pipefail near the top (immediately after the
NAMESPACE assignment) so any failing command (envsubst | oc apply, oc wait
pod/llama-stack-service, oc expose pod llama-stack-service, etc.) aborts the
script; also stop swallowing errors on secret creation—replace the silent
redirection and echo around oc create secret llama-stack-ip-secret with a check
that only ignores "already exists" (e.g., try oc create and on failure verify oc
get secret llama-stack-ip-secret returns success, otherwise exit) so unexpected
failures surface immediately.
| # Kill any existing processes on ports 8080, 8000, and 8321 (lsof may be missing in minimal images) | ||
| echo "Checking for existing processes on ports 8080, 8000, and 8321..." | ||
| if command -v lsof >/dev/null 2>&1; then | ||
| lsof -ti:8080 | xargs kill -9 2>/dev/null || true | ||
| lsof -ti:8000 | xargs kill -9 2>/dev/null || true | ||
| lsof -ti:8321 | xargs kill -9 2>/dev/null || true | ||
| elif command -v fuser >/dev/null 2>&1; then | ||
| fuser -k 8080/tcp 2>/dev/null || true | ||
| fuser -k 8000/tcp 2>/dev/null || true | ||
| fuser -k 8321/tcp 2>/dev/null || true | ||
| fi |
There was a problem hiding this comment.
Keep the /proc fallback here too.
This cleanup only uses lsof/fuser, but the comment already calls out minimal images. If neither tool exists, stale oc port-forward listeners survive and the new binds on 8080, 8000, or 8321 can keep failing with “address already in use”. Reuse the /proc-based cleanup from tests/e2e-prow/rhoai/scripts/e2e-ops.sh here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e-prow/rhoai/pipeline.sh` around lines 359 - 369, The port-cleanup
block only tries lsof and fuser and needs the /proc fallback used in
tests/e2e-prow/rhoai/scripts/e2e-ops.sh to catch stale oc port-forward
listeners; update the cleanup in pipeline.sh (the block that checks command -v
lsof / command -v fuser and kills ports 8080, 8000, 8321) to add the /proc-based
scan as the final elif/else branch that iterates /proc/*/fd, resolves socket
links to find LISTEN sockets bound to those ports and kills the owning PIDs,
mirroring the logic from e2e-ops.sh so binds won’t fail when lsof/fuser are
absent.
| # Start port-forward for mock-jwks (needed for RBAC tests to get tokens) | ||
| echo "Starting port-forward for mock-jwks..." | ||
| oc port-forward svc/mock-jwks 8000:8000 -n $NAMESPACE & | ||
| PF_JWKS_PID=$! | ||
| echo "$PF_JWKS_PID" >"$E2E_JWKS_PORT_FORWARD_PID_FILE" | ||
|
|
||
| # Behave steps that call Llama Stack directly (MCP toolgroups, shields, disrupt/restore) | ||
| # need localhost:8321. Without this forward those tests hit "Connection refused". | ||
| echo "Starting port-forward for llama-stack..." | ||
| oc port-forward svc/llama-stack-service-svc 8321:8321 -n $NAMESPACE & | ||
| PF_LLAMA_PID=$! | ||
| echo "$PF_LLAMA_PID" >"$E2E_LLAMA_PORT_FORWARD_PID_FILE" |
There was a problem hiding this comment.
Validate the initial JWKS forward before starting tests.
Unlike the Lightspeed and Llama forwards, this one is never checked after startup. If oc port-forward svc/mock-jwks exits immediately or /tokens is not reachable yet, the first RBAC scenario still flakes before e2e-ops.sh gets a chance to repair it. Add a /tokens wait loop here or call restart-jwks-port-forward immediately after spawning the forward.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e-prow/rhoai/pipeline.sh` around lines 377 - 388, The JWKS
port-forward started into PF_JWKS_PID (and persisted to
E2E_JWKS_PORT_FORWARD_PID_FILE) is not validated before tests run; add a
readiness check after spawning oc port-forward svc/mock-jwks by polling
localhost:8000/tokens (or invoking the existing restart-jwks-port-forward
helper) until it returns success or a timeout, and only proceed once /tokens is
reachable; ensure the health loop uses PF_JWKS_PID to detect if the background
process died and call restart-jwks-port-forward if needed so the RBAC tests
won't start against an unavailable JWKS endpoint.
| sleep 5 | ||
| sleep 10 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Collapse consecutive sleeps into one for clarity.
Line 95 and Line 96 are equivalent to a single sleep 15, which is easier to read and maintain.
Proposed cleanup
-sleep 5
-sleep 10
+sleep 15📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sleep 5 | |
| sleep 10 | |
| sleep 15 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e-prow/rhoai/scripts/bootstrap.sh` around lines 95 - 96, The script
contains two consecutive sleep commands ("sleep 5" followed by "sleep 10") which
should be collapsed into a single sleep to improve clarity; replace the two
lines with one "sleep 15" in the bootstrap.sh script so the total delay is
preserved while simplifying the code.
There was a problem hiding this comment.
@are-ces, sure! Here's the plan:
- Collapse the two consecutive
sleepcalls into a singlesleep 15inbootstrap.sh
oc apply -f "$BASE_DIR/manifests/operators/ds-cluster.yaml"
-sleep 5
-sleep 10
+sleep 15
</requirement> -->
---
<details>
<summary>🧠 Learnings used</summary>
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:58.708Z
Learning: In tests/e2e-prow/rhoai/, ConfigMaps like mock-jwks-script and mcp-mock-server-script are created dynamically by the pipeline.sh deployment script using oc create configmap commands, rather than being defined as static ConfigMap resources in the manifest YAML files.
</details>
---
<!-- agent-chat-trees: W3sicGF0aCI6Ii5naXRodWIvZGVwZW5kYWJvdC55bWwiLCJjb250ZW50IjoiIyBUbyBnZXQgc3RhcnRlZCB3aXRoIERlcGVuZGFib3QgdmVyc2lvbiB1cGRhdGVzLCB5b3UnbGwgbmVlZCB0byBzcGVjaWZ5IHdoaWNoXG4jIHBhY2thZ2UgZWNvc3lzdGVtcyB0byB1cGRhdGUgYW5kIHdoZXJlIHRoZSBwYWNrYWdlIG1hbmlmZXN0cyBhcmUgbG9jYXRlZC5cbiMgUGxlYXNlIHNlZSB0aGUgZG9jdW1lbnRhdGlvbiBmb3IgYWxsIGNvbmZpZ3VyYXRpb24gb3B0aW9uczpcbiMgaHR0cHM6Ly9kb2NzLmdpdGh1Yi5jb20vY29kZS1zZWN1cml0eS9kZXBlbmRhYm90L2RlcGVuZGFib3QtdmVyc2lvbi11cGRhdGVzL2NvbmZpZ3VyYXRpb24tb3B0aW9ucy1mb3ItdGhlLWRlcGVuZGFib3QueW1sLWZpbGVcblxudmVyc2lvbjogMlxudXBkYXRlczpcbiAgLSBwYWNrYWdlLWVjb3N5c3RlbTogXCJ1dlwiICMgU2VlIGRvY3VtZW50YXRpb24gZm9yIHBvc3NpYmxlIHZhbHVlc1xuICAgIGRpcmVjdG9yeTogXCIvXCIgIyBMb2NhdGlvbiBvZiBwYWNrYWdlIG1hbmlmZXN0c1xuICAgIHNjaGVkdWxlOlxuICAgICAgaW50ZXJ2YWw6IFwiZGFpbHlcIlxuIn1d -->
- [ ] <!-- {"checkboxId": "3cf7720d-6018-47fe-804e-ec1c51a05790"} --> 📌 Commit the changes to this branch
<!-- This is an auto-generated reply by CodeRabbit -->
| if [[ -f "$E2E_JWKS_PORT_FORWARD_PID_FILE" ]]; then | ||
| local saved_pf | ||
| read -r saved_pf <"$E2E_JWKS_PORT_FORWARD_PID_FILE" 2>/dev/null || true | ||
| if [[ "$saved_pf" =~ ^[0-9]+$ ]] && kill -0 "$saved_pf" 2>/dev/null; then | ||
| local http_code | ||
| http_code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 3 "http://127.0.0.1:$local_port/tokens" 2>/dev/null) || http_code="000" | ||
| if [[ "$http_code" != "000" ]]; then | ||
| echo "✓ Mock JWKS port-forward already healthy (PID: $saved_pf)" | ||
| return 0 | ||
| fi | ||
| fi | ||
| fi | ||
|
|
||
| echo "Re-establishing mock-jwks port-forward on $local_port:$remote_port..." | ||
|
|
||
| for ((attempt=1; attempt<=max_attempts; attempt++)); do | ||
| kill_stale_jwks_forward "$local_port" | ||
| sleep 2 | ||
|
|
||
| echo "JWKS port-forward attempt $attempt/$max_attempts" | ||
|
|
||
| : >"$jwks_pf_log" | ||
| nohup oc port-forward svc/mock-jwks "$local_port:$remote_port" -n "$NAMESPACE" \ | ||
| </dev/null >"$jwks_pf_log" 2>&1 & | ||
| pf_pid=$! | ||
| disown "$pf_pid" 2>/dev/null || true | ||
| sleep 3 | ||
|
|
||
| if ! kill -0 "$pf_pid" 2>/dev/null; then | ||
| echo "JWKS port-forward process exited immediately" | ||
| continue | ||
| fi | ||
|
|
||
| local http_code | ||
| http_code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://127.0.0.1:$local_port/tokens" 2>/dev/null) || http_code="000" | ||
| if [[ "$http_code" != "000" ]]; then | ||
| echo "$pf_pid" >"$E2E_JWKS_PORT_FORWARD_PID_FILE" | ||
| echo "✓ Mock JWKS port-forward established (PID: $pf_pid)" | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
Require /tokens to succeed, not merely respond.
Both checks here treat any non-000 HTTP code as “healthy”. That lets a 404 or 500 on mock-jwks pass restart logic even though RBAC steps still cannot mint tokens.
Suggested fix
- if [[ "$http_code" != "000" ]]; then
+ if [[ "$http_code" == "200" ]]; then
echo "✓ Mock JWKS port-forward already healthy (PID: $saved_pf)"
return 0
fi
...
- if [[ "$http_code" != "000" ]]; then
+ if [[ "$http_code" == "200" ]]; then
echo "$pf_pid" >"$E2E_JWKS_PORT_FORWARD_PID_FILE"
echo "✓ Mock JWKS port-forward established (PID: $pf_pid)"
return 0
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [[ -f "$E2E_JWKS_PORT_FORWARD_PID_FILE" ]]; then | |
| local saved_pf | |
| read -r saved_pf <"$E2E_JWKS_PORT_FORWARD_PID_FILE" 2>/dev/null || true | |
| if [[ "$saved_pf" =~ ^[0-9]+$ ]] && kill -0 "$saved_pf" 2>/dev/null; then | |
| local http_code | |
| http_code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 3 "http://127.0.0.1:$local_port/tokens" 2>/dev/null) || http_code="000" | |
| if [[ "$http_code" != "000" ]]; then | |
| echo "✓ Mock JWKS port-forward already healthy (PID: $saved_pf)" | |
| return 0 | |
| fi | |
| fi | |
| fi | |
| echo "Re-establishing mock-jwks port-forward on $local_port:$remote_port..." | |
| for ((attempt=1; attempt<=max_attempts; attempt++)); do | |
| kill_stale_jwks_forward "$local_port" | |
| sleep 2 | |
| echo "JWKS port-forward attempt $attempt/$max_attempts" | |
| : >"$jwks_pf_log" | |
| nohup oc port-forward svc/mock-jwks "$local_port:$remote_port" -n "$NAMESPACE" \ | |
| </dev/null >"$jwks_pf_log" 2>&1 & | |
| pf_pid=$! | |
| disown "$pf_pid" 2>/dev/null || true | |
| sleep 3 | |
| if ! kill -0 "$pf_pid" 2>/dev/null; then | |
| echo "JWKS port-forward process exited immediately" | |
| continue | |
| fi | |
| local http_code | |
| http_code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://127.0.0.1:$local_port/tokens" 2>/dev/null) || http_code="000" | |
| if [[ "$http_code" != "000" ]]; then | |
| echo "$pf_pid" >"$E2E_JWKS_PORT_FORWARD_PID_FILE" | |
| echo "✓ Mock JWKS port-forward established (PID: $pf_pid)" | |
| return 0 | |
| fi | |
| if [[ -f "$E2E_JWKS_PORT_FORWARD_PID_FILE" ]]; then | |
| local saved_pf | |
| read -r saved_pf <"$E2E_JWKS_PORT_FORWARD_PID_FILE" 2>/dev/null || true | |
| if [[ "$saved_pf" =~ ^[0-9]+$ ]] && kill -0 "$saved_pf" 2>/dev/null; then | |
| local http_code | |
| http_code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 3 "http://127.0.0.1:$local_port/tokens" 2>/dev/null) || http_code="000" | |
| if [[ "$http_code" == "200" ]]; then | |
| echo "✓ Mock JWKS port-forward already healthy (PID: $saved_pf)" | |
| return 0 | |
| fi | |
| fi | |
| fi | |
| echo "Re-establishing mock-jwks port-forward on $local_port:$remote_port..." | |
| for ((attempt=1; attempt<=max_attempts; attempt++)); do | |
| kill_stale_jwks_forward "$local_port" | |
| sleep 2 | |
| echo "JWKS port-forward attempt $attempt/$max_attempts" | |
| : >"$jwks_pf_log" | |
| nohup oc port-forward svc/mock-jwks "$local_port:$remote_port" -n "$NAMESPACE" \ | |
| </dev/null >"$jwks_pf_log" 2>&1 & | |
| pf_pid=$! | |
| disown "$pf_pid" 2>/dev/null || true | |
| sleep 3 | |
| if ! kill -0 "$pf_pid" 2>/dev/null; then | |
| echo "JWKS port-forward process exited immediately" | |
| continue | |
| fi | |
| local http_code | |
| http_code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://127.0.0.1:$local_port/tokens" 2>/dev/null) || http_code="000" | |
| if [[ "$http_code" == "200" ]]; then | |
| echo "$pf_pid" >"$E2E_JWKS_PORT_FORWARD_PID_FILE" | |
| echo "✓ Mock JWKS port-forward established (PID: $pf_pid)" | |
| return 0 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e-prow/rhoai/scripts/e2e-ops.sh` around lines 505 - 544, The health
checks treat any non-"000" curl result as healthy; change the logic that
examines the curl http_code (the two occurrences that compare http_code != "000"
when checking "http://127.0.0.1:$local_port/tokens") to require a successful
response (e.g. http_code == "200" or at minimum a 2xx check) before claiming the
JWKS port-forward is healthy and before writing pf_pid to
E2E_JWKS_PORT_FORWARD_PID_FILE; adjust the checks near the variables
E2E_JWKS_PORT_FORWARD_PID_FILE, jwks_pf_log, pf_pid and the curl invocations so
only a true success (200/2xx) allows early return or saving the PID.
| print(f"update-configmap stdout: {result.stdout}") | ||
| print(f"update-configmap stderr: {result.stderr}") |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use module logger instead of print for new failure diagnostics.
Line 208 and Line 209 should use structured logging to stay consistent with repo logging standards.
Proposed change
+from log import get_logger
+
+logger = get_logger(__name__)
@@
- print(f"update-configmap stdout: {result.stdout}")
- print(f"update-configmap stderr: {result.stderr}")
+ logger.error("update-configmap stdout: %s", result.stdout)
+ logger.error("update-configmap stderr: %s", result.stderr)As per coding guidelines, "Use logger = get_logger(__name__) from log.py for module logging".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(f"update-configmap stdout: {result.stdout}") | |
| print(f"update-configmap stderr: {result.stderr}") | |
| from log import get_logger | |
| logger = get_logger(__name__) | |
| # ... existing code ... | |
| logger.error("update-configmap stdout: %s", result.stdout) | |
| logger.error("update-configmap stderr: %s", result.stderr) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e/utils/prow_utils.py` around lines 208 - 209, Replace the two print
calls used for update-configmap diagnostics with the module logger: add "logger
= get_logger(__name__)" (importing get_logger from log.py) near the top of
prow_utils.py, then change the prints that reference result.stdout and
result.stderr to structured logger calls (e.g., logger.debug or logger.info with
format placeholders) so the messages use the module logger and pass
result.stdout/result.stderr as parameters instead of using print.
|
@radofuchs we should check whether the Konflux e2e tests run, there are large updates to |
Behave clears custom context attributes between scenarios, so hostname_llama/port_llama are gone by the time after_feature runs. Store them in module-level state (same pattern as llama_stack_was_running). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e/features/environment.py`:
- Around line 446-451: The restore logic now runs unconditionally when
get_llama_stack_was_running() is true, which can attempt to restore a
llama-stack in library-mode runs; update the condition to also check that we're
not in library mode by reintroducing the context.is_library_mode guard (i.e.,
only call _restore_llama_stack() and reset_llama_stack_was_running() when
get_llama_stack_was_running() is true AND context.is_library_mode is False),
locating the check around get_llama_stack_was_running(), _restore_llama_stack,
and reset_llama_stack_was_running to ensure teardown skips llama-stack
restoration in library-mode scenarios.
- Around line 159-183: The hook currently logs failures from both
run_e2e_ops("restart-port-forward") and run_e2e_ops("restart-lightspeed") but
continues; change before_scenario so that after both recovery attempts either
re-probe the service readiness (e.g., call the existing readiness check or
run_e2e_ops("check-readiness")/similar) and if that probe still fails or any
restart timed out/returned non-zero, raise an exception (or call sys.exit) to
hard-fail the scenario; update the blocks around run_e2e_ops to capture
timeouts/non-zero return codes and, if recovery did not succeed, raise a
RuntimeError with a clear message so the scenario stops immediately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a55c71f3-961c-4ae1-b1d9-058d815b1cba
📒 Files selected for processing (2)
tests/e2e/features/environment.pytests/e2e/features/steps/common.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: build-pr
- GitHub Check: radon
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: E2E: server mode / ci / group 1
- GitHub Check: E2E: library mode / ci / group 1
- GitHub Check: E2E: server mode / ci / group 2
- GitHub Check: E2E: library mode / ci / group 3
- GitHub Check: E2E: server mode / ci / group 3
- GitHub Check: E2E: library mode / ci / group 2
- GitHub Check: E2E Tests for Lightspeed Evaluation job
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Import FastAPI dependencies with:from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with:from llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Usetyping_extensions.Selffor model validators in Pydantic models
Use modern union type syntaxstr | intinstead ofUnion[str, int]
UseOptional[Type]for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Useasync deffor I/O operations and external API calls
HandleAPIConnectionErrorfrom Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with@abstractmethoddecorators
Use@model_validatorand@field_validatorfor Pydantic model validation
Complete type annotations for all class attributes; use specific types, notAny
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections
Files:
tests/e2e/features/steps/common.pytests/e2e/features/environment.py
tests/e2e/**/*.{py,feature}
📄 CodeRabbit inference engine (AGENTS.md)
Use behave (BDD) framework for end-to-end testing
Files:
tests/e2e/features/steps/common.pytests/e2e/features/environment.py
🧠 Learnings (3)
📚 Learning: 2026-04-13T13:39:54.963Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1490
File: tests/e2e/features/environment.py:206-211
Timestamp: 2026-04-13T13:39:54.963Z
Learning: In lightspeed-stack E2E tests under tests/e2e/features, it is intentional to set context.feature_config inside Background/step functions (scenario-scoped Behave layer). The environment.py after_scenario restore logic should only restore configuration when context.scenario_lightspeed_override_active is True; this flag is set by configure_service only when a real config switch occurs (so restore does not run for scenarios without a switch). Additionally, steps/common.py’s module-level _active_lightspeed_stack_config_basename is used to prevent re-applying the same config across subsequent scenarios, ensuring scenario_lightspeed_override_active stays False after the first apply. Therefore, reviewers should not “fix” this flow as if feature_config were incorrectly scoped or if after_scenario restoration is missing—config switching and restoration are meant to happen exactly once per actual switch, not redundantly per scenario.
Applied to files:
tests/e2e/features/steps/common.pytests/e2e/features/environment.py
📚 Learning: 2026-04-07T09:20:26.590Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1467
File: tests/e2e/features/steps/common.py:36-49
Timestamp: 2026-04-07T09:20:26.590Z
Learning: For Behave-based Python tests, rely on Behave’s Context layered stack for attribute lifecycle: Behave pushes a new Context layer when entering feature scope (before_feature) and again for scenario scope (before_scenario). Attributes assigned inside given/when/then steps live on the current scenario layer and are automatically removed when the scenario ends. As a result, step-set attributes should not be expected to persist across scenarios or features, and manual cleanup in after_scenario/after_feature is generally unnecessary for attributes set in step functions. Only perform manual cleanup for attributes that you set explicitly in before_feature/before_scenario, since those live on the respective feature/scenario layers.
Applied to files:
tests/e2e/features/steps/common.pytests/e2e/features/environment.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to tests/e2e/**/*.{py,feature} : Use behave (BDD) framework for end-to-end testing
Applied to files:
tests/e2e/features/environment.py
🔇 Additional comments (1)
tests/e2e/features/steps/common.py (1)
24-41: Persisting the Llama endpoint outsidecontextis the right fix.This keeps the teardown inputs alive after Behave drops scenario-scoped attributes from
context.Based on learnings, step-set Behave context attributes are cleared after each scenario, so teardown data that must survive to
after_featurecannot live only oncontext.Also applies to: 63-64
| print("[before_scenario] Port-forward appears dead, restarting...") | ||
| try: | ||
| result = run_e2e_ops("restart-port-forward", timeout=60) | ||
| print(result.stdout, end="") | ||
| if result.returncode == 0: | ||
| print("[before_scenario] Port-forward re-established") | ||
| return | ||
| print(result.stderr, end="") | ||
| except subprocess.TimeoutExpired: | ||
| pass | ||
|
|
||
| # Port-forward alone failed — the pod itself may be dead (e.g. Llama Stack | ||
| # was never restored after a disruption feature). Attempt a full restart, | ||
| # which also checks Llama health before recreating LCS. | ||
| print("[before_scenario] Port-forward failed; attempting full pod restart...") | ||
| try: | ||
| result = run_e2e_ops("restart-lightspeed", timeout=200) | ||
| print(result.stdout, end="") | ||
| if result.returncode != 0: | ||
| print(result.stderr, end="") | ||
| print("[before_scenario] Warning: full pod restart failed") | ||
| else: | ||
| print("[before_scenario] Pod restart + port-forward re-established") | ||
| except subprocess.TimeoutExpired: | ||
| print("[before_scenario] Warning: full pod restart timed out") |
There was a problem hiding this comment.
Don’t continue the scenario after both port-forward recovery paths fail.
When Line 161 and Line 175 both fail or time out, this hook only logs and returns. The scenario then runs against a known-dead endpoint and fails later in unrelated steps, which makes CI triage much harder. Re-probe readiness and raise here if recovery did not succeed.
💡 Suggested hard-fail after unsuccessful recovery
except subprocess.TimeoutExpired:
print("[before_scenario] Warning: full pod restart timed out")
+
+ try:
+ resp = requests.get(url, timeout=5)
+ if resp.status_code not in (200, 401):
+ raise AssertionError(
+ f"Lightspeed readiness check still failing after recovery: {resp.status_code}"
+ )
+ except requests.RequestException as exc:
+ raise AssertionError(
+ "Lightspeed port-forward could not be recovered before the scenario"
+ ) from exc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e/features/environment.py` around lines 159 - 183, The hook currently
logs failures from both run_e2e_ops("restart-port-forward") and
run_e2e_ops("restart-lightspeed") but continues; change before_scenario so that
after both recovery attempts either re-probe the service readiness (e.g., call
the existing readiness check or run_e2e_ops("check-readiness")/similar) and if
that probe still fails or any restart timed out/returned non-zero, raise an
exception (or call sys.exit) to hard-fail the scenario; update the blocks around
run_e2e_ops to capture timeouts/non-zero return codes and, if recovery did not
succeed, raise a RuntimeError with a clear message so the scenario stops
immediately.
| # Restore Llama Stack FIRST (before any lightspeed-stack restart). | ||
| # Read from module-level state — Behave clears custom context attributes | ||
| # between scenarios, so context.llama_stack_was_running is unreliable here. | ||
| if get_llama_stack_was_running(): | ||
| _restore_llama_stack() | ||
| reset_llama_stack_was_running() |
There was a problem hiding this comment.
Keep the library-mode guard around Llama Stack restoration.
Line 449 now restores whenever get_llama_stack_was_running() is true, but this file explicitly treats library mode as having no separate llama-stack runtime. Dropping the not context.is_library_mode check can make feature teardown try to restore a container/pod that should not exist in library runs.
💡 Suggested guard restoration
- if get_llama_stack_was_running():
+ if not context.is_library_mode and get_llama_stack_was_running():
_restore_llama_stack()
reset_llama_stack_was_running()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Restore Llama Stack FIRST (before any lightspeed-stack restart). | |
| # Read from module-level state — Behave clears custom context attributes | |
| # between scenarios, so context.llama_stack_was_running is unreliable here. | |
| if get_llama_stack_was_running(): | |
| _restore_llama_stack() | |
| reset_llama_stack_was_running() | |
| # Restore Llama Stack FIRST (before any lightspeed-stack restart). | |
| # Read from module-level state — Behave clears custom context attributes | |
| # between scenarios, so context.llama_stack_was_running is unreliable here. | |
| if not context.is_library_mode and get_llama_stack_was_running(): | |
| _restore_llama_stack() | |
| reset_llama_stack_was_running() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e/features/environment.py` around lines 446 - 451, The restore logic
now runs unconditionally when get_llama_stack_was_running() is true, which can
attempt to restore a llama-stack in library-mode runs; update the condition to
also check that we're not in library mode by reintroducing the
context.is_library_mode guard (i.e., only call _restore_llama_stack() and
reset_llama_stack_was_running() when get_llama_stack_was_running() is true AND
context.is_library_mode is False), locating the check around
get_llama_stack_was_running(), _restore_llama_stack, and
reset_llama_stack_was_running to ensure teardown skips llama-stack restoration
in library-mode scenarios.
Description
Fixes the RHOAI inference provider Prow e2e pipeline which was failing due to multiple cascading issues: RHOAI image version bump, in-cluster image build errors, namespace handling, port-forward instability, secret ordering, Llama Stack disruption cascade, and missing RAG/enrichment configuration.
Key changes:
llama-stack-prow.yamlmanifest with enrichment + RAG restore init containerspipeline.sh: in-cluster image build, bootstrap ordering, namespace hardcoding, secret exports, config paths, model/provider override env varse2e-ops.sh: restart failures, mock-jwks port-forward, configmap cascade, port-forward resilienceverify_connectivityfor auth-enabled Prow environmentsDataScienceClusterCRType of change
Tools used to create PR
Related Tickets & Documents
Checklist before requesting a review
Testing
Verify here
Summary by CodeRabbit
Improvements
Tests